Libraries used

These are the libraries I have used in order to complete the project

library(ggplot2)
library(corrplot)
library(Stack)
library(fastDummies)
library(dplyr)
library(leaflet)
library(caret)
library(randomForest)

Import the data

The data is uploaded on a gist, and we store it in two data.frame objects called train and test

train <- read.csv("https://gist.githubusercontent.com/alombatti/f2cb8f1a244784999353309a97a6777f/raw/f38ee90641da97f62964ac10ea558564bfc6cb65/house_price_train.csv")
test <- read.csv("https://gist.githubusercontent.com/alombatti/1df87e336f159842e64ced9380b13e25/raw/49ea0fe6b42fb5d4868e1291dc8e47b20d2cc3bd/house_price_test.csv")

Data understanding

This dataset contains house sale prices for King County, which includes Seattle, Washington. It includes homes sold between May 2014 and May 2015. The dataset provided to us has already been devided in train set and test set, in the sizes of 17277 observations for the train and 4320 observations for the test.

The fields describing the data are:

  1. id: a notation for a house
  2. date: the date the house was sold
  3. price: the prediction target, consisting in the amount in US Dollars the house was sold at
  4. bedrooms: the number of bedrooms in the house
  5. bathrooms: the number of bathrooms in the house
  6. sqft_living: square footage of the home
  7. sqft_lot: square footage of the lot
  8. floors: total floors (levels) in the house
  9. waterfront: use which has a view to the waterfront
  10. view: how many times the house has been viewed
  11. condition: overall condition of the house
  12. grade: overall grade given to the housing unit, based on King County grading system
  13. sqft_above: square footage of house apart from basement
  14. sqft_basement: square footage of the basement
  15. _yr_built: the year in which the house was completed
  16. yr_renovated: year when house was renovated
  17. zipcode: zip
  18. lat: latitude coordinate
  19. long: longitude coordinate
  20. sqft_living15: living room area in 2015 (if different fron sqft_living, it implies some renovations)
  21. sqft_lot15: lot area in 2015 (if different from sqft_lot, it implies some renovations)

Here is a glimpse of what the STRUCTURE of the data looks like (for the train set)

str(train)
## 'data.frame':    17277 obs. of  21 variables:
##  $ id           : num  9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
##  $ date         : Factor w/ 372 levels "1/10/2015","1/12/2015",..: 211 331 290 20 226 223 94 34 76 294 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...

We then check for NAs in the data we are going to use, both in the train and in the test.

print(length(which(is.na(train) == T)))
## [1] 0
print(length(which(is.na(test) == T)))
## [1] 0

As we can see, the data does not contain any null values, so we can proceed with our analysis.

Transformation of the DATE field

In order to analyze the field DATE, and to be able to determine whether it is relevant in our analysis, we have to transform it so that we can use it. I thought of separating the fields year, month, and day, to, further in the analysis, determine if any of them is directly correlated with the price.

First, I change the field date from factor to date, and I store in a new dataframe the complete field of the date, its year, its month, and its day. Then, I merge the newly created dataset with the one from which I extracted the date field, and I reorder the column in a way that makes sense for the analyais.

I repeat the same procedure for the test and the train datasets.

train$date <- as.Date(train$date, "%m/%d/%Y")
df.date_train <- data.frame(date = train$date,
                      year = as.numeric(format(train$date, format = "%Y")),
                      month = as.numeric(format(train$date, format = "%m")),
                      day = as.numeric(format(train$date, format = "%d")))

train <- unique(merge(train, df.date_train))
rm(df.date_train)
train <- train[, c(2, 1, 22, 23, 24, 3:21)]
test$date <- as.Date(test$date, "%m/%d/%Y")
df.date_test <- data.frame(date = test$date,
                           year = as.numeric(format(test$date, format = "%Y")),
                           month = as.numeric(format(test$date, format = "%m")),
                           day = as.numeric(format(test$date, format = "%d")))

test <- unique(merge(test, df.date_test))
rm(df.date_test)
test <- test[, c(2, 1, 21, 22, 23, 3:20)]

Now I stack together the train and test dataset to procede with a more accurate analysis. I also eliminate the field id and date because they are not going to be relevant.

complete <- Stack(train, test)
complete$date = NULL

The dataset now looks like this:

head(complete)
##           id year month day bedrooms bathrooms sqft_living sqft_lot floors
## 1 8673400086 2014     5   2        3      2.50        1270     1180    3.0
## 2 5561000190 2014     5   2        3      2.25        1970    35100    2.0
## 3 7575600100 2014     5   2        3      2.50        2090    10834    1.0
## 4 9267200226 2014     5   2        3      2.50        1770     1235    3.0
## 5 2025700730 2014     5   2        3      3.00        1850    19966    1.0
## 6 1868902745 2014     5   2        3      2.00        2710     4500    1.5
##   waterfront view condition grade sqft_above sqft_basement yr_built
## 1          0    0         3     8       1270             0     2001
## 2          0    0         4     9       1970             0     1977
## 3          0    0         4     8       1360           730     1987
## 4          0    0         3     8       1600           170     2007
## 5          0    0         4     7       1090           760     1992
## 6          0    0         4     8       1880           830     1929
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15  price
## 1            0   98107 47.6697 -122.392          1320       1180 445700
## 2            0   98027 47.4635 -121.991          2340      35100 437500
## 3            0   98003 47.3537 -122.303          1750       8595 285000
## 4            0   98103 47.6965 -122.342          1680       1203 436110
## 5            0   98038 47.3493 -122.034          1410       6715 287200
## 6            0   98115 47.6747 -122.295          2060       4500 805000

Correlation matrix

Now I plot a correlation graph to see how much the different variables are correlated to our target variable PRICE

corr = cor(train[, 3:24])

corrplot(corr, method = "color",
         outline = T,
         cl.pos = "n",
         rect.col = "black",
         tl.col = "indianred4",
         addCoef.col = "black",
         number.digits = 2,
         number.cex = 0.60,
         tl.cex = 0.7,
         cl.cex = 1,
         col = colorRampPalette(c("red", "white", "green4")) (100))

From the matrix, we can see that we have many variables correlated somehow with price, some more and some less. I decided to proceed my analysis keeoing only the variables that have a correlation >= 0.15 with the target variable.

Therefore, going forward I will only keep:

  1. price
  2. bedrooms
  3. bathrooms
  4. sqft_living
  5. floors
  6. waterfront
  7. view
  8. grade
  9. sqft_above
  10. sqft_basement
  11. lat
  12. sqft_living15

Nonetheless, some considerations have to be made. The variables condition, yr_renovated, and zipcode are going to be converted as factors further on in the analysis. Therefore, we still include them in the listof variables we are going to keep because we can not really interepret their value now since it is numeric, and as numeric it might not affect the correlation with price.

complete2 <- select(complete, "id", "price", "bedrooms", "bathrooms", "sqft_living", "floors",
                    "waterfront", "view", "grade", "sqft_above", "sqft_basement", "lat",
                    "sqft_living15", "condition", "yr_renovated", "zipcode")

Data Visualization

This is an interactive map with all the houses based on their geographical position, coloured by their price.

complete$PriceBin <- cut(complete$price, c(0, 250e3, 500e3, 750e3, 1e6, 2e6, 999e6))

center_lon = median(complete$long, na.rm = TRUE)
center_lat = median(complete$lat, na.rm = TRUE)

factpal <- colorFactor(c("black", "blue", "yellow", "orange", "#0B5345", "red"),
                       complete$PriceBin)



leaflet(complete) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircles(lng = ~ long, lat = ~ lat, 
             color = ~ factpal(PriceBin))  %>%
  setView(lng = center_lon, lat = center_lat, zoom = 12) %>%
  
  addLegend("bottomright", pal = factpal, values = ~ PriceBin,
            title = "House Price Distribution",
            opacity = 1)
complete$PriceBin <- NULL

In this section we are going to analyze the relationship of all the varaibles with our target, the variable price. Before doing this, I create a copy of the dataset, and I split it in train and test, based on the value of the variable price.

train2 <- split(complete2, complete2$price > 0)
train2 <- train2[["TRUE"]]

test2 <- split(complete2, is.na(complete2$price))
test2 <- test2[["TRUE"]]

Distribution of prices

Here we look at the histrogram showing how many houses have been sold for each price.

graph1 <- ggplot(data = train2, aes(x = price)) +    
  geom_histogram(alpha = 0.8, fill = "#F1C40F") +
  labs(x = "price", y = "houses", title = "Distribution of Prices") +
  theme_bw()

graph1

Since the interval of price of the houses is very wide, it is smart to focus only on a small segment, to see the patten where most of the houses are. In this histogram, we only see the distribution of houses sold at a price up to 1.5 million US Dollars.

graph2 <- ggplot(data = train2, aes(x = price)) +    
  geom_histogram(alpha = 0.8, fill = "#F1C40F") +
  scale_x_continuous(limits=c(0, 1.5e6)) +
  labs(x = "price", y = "houses", title = "Distribution of Prices (up to 1.5 million)") +
  theme_bw()

graph2


For the sole purpose of data visualization, I now add a comuln to my train dataframe rapresenting the logarithmic transformation of the price, so that the graphs will look prettier and more understandable.

train2$logprice = log(train2$price)

Boxplot between price and bedrooms

graph3 <- boxplot(train2[, "logprice"] ~ train2[, "bedrooms"],
                  main = "Price vs Bedrooms", col=c("#F1C40F","#336633"),
                  xlab="bedrooms", ylab="log (price)")

From this graph, we notice there are two odd values, when the bedrooms are 11 and when the bedrooms are 33.

print(subset(train2, train2$bedrooms > 10))
##              id  price bedrooms bathrooms sqft_living floors waterfront
## 2866 2402100895 640000       33      1.75        1620      1          0
## 6003 1773100755 520000       11      3.00        3000      2          0
##      view grade sqft_above sqft_basement     lat sqft_living15 condition
## 2866    0     7       1040           580 47.6878          1330         5
## 6003    0     7       2400           600 47.5560          1420         3
##      yr_renovated zipcode logprice
## 2866            0   98103 13.36922
## 6003         1999   98106 13.16158

In fact, we can see all the houses with a number of bedrooms bigger than 10 is just two, therefore we exclude these records from our analysis not to disturb our model.

Boxplot between price and bathrooms

graph4 <- boxplot(train2[, "logprice"] ~ train2[, "bathrooms"],
                  main = "Price vs Bathrooms", col=c("#F1C40F","#336633"),
                  xlab="bathrooms", ylab="log (price)")

The relationship looks linear overall. Nonetheless, the price of the house with 7.5 bathrooms is relatively lower than its neighbours. Therefore, we decide to eliminate that value.

Boxplot between price and floors

graph5 <- boxplot(train2[, "logprice"] ~ train2[, "floors"],
                  main = "Price vs Floors", col=c("#F1C40F","#336633"),
                  xlab="floors", ylab="log (price)")

The relationship is linear.

Boxplot between price and waterfront

graph6 <- boxplot(train2[, "logprice"] ~ train2[, "waterfront"],
                  main = "Price vs Waterfront", col=c("#F1C40F","#336633"),
                  xlab="waterfront", ylab="log (price)")

The relationsip is linear. Here, the 0 rapresents the houses without a waterfront, while the value 1 rapresents the houses with a waterfront. It is common sense that, if the house has a waterfront, its price is higher.

Boxplot between price and view

graph7 <- boxplot(train2[, "logprice"] ~ train2[, "view"],
                  main = "Price vs View", col=c("#F1C40F","#336633"),
                  xlab="view", ylab="log (price)")

The relationsip is linear. Here, the 0 rapresents the worst view, while the value 4 rapresents the best view. It is common sense that, the more beautiful the view, the higher the price of the house.

Boxplot between price and condition

graph8 <- boxplot(train2[, "logprice"] ~ train2[, "condition"],
                  main = "Price vs Condition", col=c("#F1C40F","#336633"),
                  xlab="condition", ylab="log (price)")

The relationship is linear.

Boxplot between price and grade

graph9 <- boxplot(train2[, "logprice"] ~ train2[, "grade"],
                  main = "Price vs Grade", col=c("#F1C40F","#336633"),
                  xlab="grade", ylab="log (price)")

The relationship is linear.

Boxplot between price and sqft_living

graph10 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living"],
                  main = "Price vs sqft_Living", col=c("#F1C40F","#336633"),
                  xlab="sqft_living", ylab="log (price)")

The relationship looks linear. The bigger the living room, the higher the price of the house.

Boxplot between price and sqft_basement

graph11 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_basement"],
                   main = "Price vs sqft_Basement", col=c("#F1C40F","#336633"),
                   xlab="sqft_basement", ylab="log (price)")

This graph is interesting. Here, we can see that a lot of houses do not have a basement at all, while tbe houses that have a basement, even if the size of the basement changes a lot, their prices do not really change.

length(train2$sqft_basement[train2$sqft_basement == 0])
## [1] 10521
length(train2$sqft_basement[train2$sqft_basement != 0])
## [1] 6756

Therefore, I decided to simplify this field and make it a boolean: it will take value 0 if the house does not have a basement, and it will take value 1 if the house has a basement, no matter its size.

Boxplot between price and yr_renovated

graph12 <- boxplot(train2[, "logprice"] ~ train2[, "yr_renovated"],
                   main = "Price vs yr_Renovated", col=c("#F1C40F","#336633"),
                   xlab="yr_renovated", ylab="log (price)")

This graph is very similar to the previous one regarding the basement situation. As for having a basement or not, the same reasoning can be applied to the renovation of the house. It does not really matter when the house has been renovated, but what matters is if the house has been renovated at all.

length(train2$yr_renovated[train2$yr_renovated == 0])
## [1] 16538
length(train2$yr_renovated[train2$yr_renovated != 0])
## [1] 739

As we can see, most of the houses have not been renovated, so we will transform this field into boolean.

Boxplot between price and zipcode

graph13 <- boxplot(train2[, "logprice"] ~ train2[, "zipcode"],
                   main = "Price vs Zipcode", col=c("#F1C40F","#336633"),
                   xlab="zipcode", ylab="log (price)")

We can not really see a pattern in this graph, so we will categorize the variable when modelling (as predicted before).

Boxplot between price and sqft_living15

graph14 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living15"],
                   main = "Price vs sqft_Living(15)", col=c("#F1C40F","#336633"),
                   xlab="sqft_living(15)", ylab="log (price)")

As for the variable sqft_living, we see a direct relationship in this graph. As the area of the living room increases, the price ofthe house increases as well.




Given all the insights from the data visualization section, we can proceed in apply the changes to our dataset.

First, as seen in the graph for the bathrooms, we eliminate the bathroom with value 7.5 because its price was odd and probably it was a mistake.

train2 <- train2[-5419, ]

Then, as seen in the graph for the bedrooms, we only keep those house with 10 or less bedrooms, since the two houses with 11 and 33 bedrooms had wrong values for their price.

train2 <- train2[train2$bedrooms <= 10, ]

We dummify the sqft_basement field, transforming all the houses with a basement with value 1, and keeping all the houses without a basement with value 0.

train2$sqft_basement[train2$sqft_basement != 0] = 1
test2$sqft_basement[test2$sqft_basement != 0] = 1

We dummify the yr_renovated field, transforming all the houses that has been renovated with value 1, and keeping all the houses without renovation with value 0.

train2$yr_renovated[train2$yr_renovated != 0] = 1
test2$yr_renovated[test2$yr_renovated != 0] = 1



Lastly, we stack back together the train and test set.

train2$logprice <- NULL
complete2 <- Stack(train2, test2)

Data Transformation

I create a copy of the complete dataset to proceed with the analysis.

complete3 <- complete2

Now, it is time to standardize the variables involving the squarefeet. I create a function that normalizes the value of the field between 0 and 1, 0 being the lowest value and 1 being the highest.

range01 <- function(x) {
  (x-min(x)) / (max(x)-min(x))
}

Therefore, I can proceed with the normalization of the variables sqft_living, sqft_above, and sqft_living15.

complete3$sqft_living <- range01(complete3$sqft_living)
complete3$sqft_above <- scale(complete3$sqft_above)
complete3$sqft_living15 <- scale(complete3$sqft_living15)

In order to proceed with the design of the model, I transform as factor the following variables because they do not make sense as numeric in this context:

  1. bathrooms
  2. floors
  3. waterfront
  4. view
  5. grade
  6. sqft_basement
  7. yr_renovated
  8. condition
  9. zipcode
complete3$bedrooms <- as.factor(complete3$bedrooms)
complete3$bathrooms <- as.factor(complete3$bathrooms)
complete3$floors <- as.factor(complete3$floors)
complete3$waterfront <- as.factor(complete3$waterfront)
complete3$view <- as.factor(complete3$view)
complete3$grade <- as.factor(complete3$grade)
complete3$sqft_basement <- as.factor(complete3$sqft_basement)
complete3$yr_renovated <- as.factor(complete3$yr_renovated)
complete3$condition <- as.factor(complete3$condition)
complete3$zipcode <- as.factor(complete3$zipcode)

Another variable that does not make sense to take as numeric is lat. Nonetheless, we can not consider it as factor neither because of the too many values it takes.

Therefore, we bin the factor in 13 intervals.

breaks_lat <- c(-Inf, 47.20, 47.25, 47.30, 47.35, 47.40, 47.45, 47.50, 47.55, 47.60,
                47.65, 47.70, 47.75, Inf)

names_lat <- c("47.15 - 47.20", "47.20 - 47.25", "47.25 - 47.30", "47.30 - 47.35",
               "47.35 - 47.40", "47.40 - 47.45", "47.45 - 47.50", "47.50 - 47.55",
               "47.55 - 47.60", "47.60 - 47.65", "47.65 - 47.70", "47.70 - 47.75",
               "47.75 - 47.80")

complete3$lat <- cut(complete3$lat, breaks = breaks_lat, labels = names_lat)

In the end, I unstack the train and test dataset from the complete dataset to build our model.

train3 <- split(complete3, complete3$price > 0)
train3 <- train3[["TRUE"]]

test3 <- split(complete3, is.na(complete3$price))
test3 <- test3[["TRUE"]]

Modelling

(1) Linear regression

I decided to use a linear regression model.

First, I create a dataset with all the variables I am going to need in my model.

train4 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
                     "grade", "sqft_above", "sqft_living15", "lat", "condition",
                     "yr_renovated", "zipcode", "price")]
test4 <- test3

Then, I set up the settings for my model:

formula = price ~ .

fitControl <- trainControl(method="cv",number = 5)

And last, I run my model.

model1 <- train(formula,
                data = train4,
                method = "lm",
                trControl = fitControl,
                metric = "MAE")

The model scores an R2 of 0.84, which is a relatively good result.

summary(model1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2190582   -59141     1648    54520  3583476 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         113830.4   175708.9   0.648 0.517100    
## bedrooms2            18314.4    12350.5   1.483 0.138124    
## bedrooms3            26826.7    12385.8   2.166 0.030330 *  
## bedrooms4            10704.3    12674.0   0.845 0.398351    
## bedrooms5             3461.4    13376.7   0.259 0.795819    
## bedrooms6           -41118.7    16297.5  -2.523 0.011645 *  
## bedrooms7          -174762.9    30624.6  -5.707 1.17e-08 ***
## bedrooms8           -27843.7    48510.5  -0.574 0.565993    
## bedrooms9          -215650.8    85973.6  -2.508 0.012139 *  
## bedrooms10         -306878.0    85936.7  -3.571 0.000357 ***
## bathrooms0.75         2854.0    85907.9   0.033 0.973498    
## bathrooms1           39490.9    83609.0   0.472 0.636699    
## bathrooms1.25       -21275.3    98048.7  -0.217 0.828221    
## bathrooms1.5         30505.9    83755.3   0.364 0.715693    
## bathrooms1.75        29472.3    83716.6   0.352 0.724806    
## bathrooms2           29266.7    83743.6   0.349 0.726734    
## bathrooms2.25        47229.1    83782.8   0.564 0.572960    
## bathrooms2.5         36145.9    83765.9   0.432 0.666102    
## bathrooms2.75        33780.6    83870.1   0.403 0.687120    
## bathrooms3           48649.3    83947.3   0.580 0.562244    
## bathrooms3.25       101589.6    84065.0   1.208 0.226885    
## bathrooms3.5         64331.5    84029.3   0.766 0.443935    
## bathrooms3.75       115765.8    84934.2   1.363 0.172898    
## bathrooms4          121190.7    85050.8   1.425 0.154198    
## bathrooms4.25       218580.3    86014.1   2.541 0.011056 *  
## bathrooms4.5        175160.7    85573.8   2.047 0.040684 *  
## bathrooms4.75       458082.7    90560.2   5.058 4.27e-07 ***
## bathrooms5          145126.9    92667.0   1.566 0.117341    
## bathrooms5.25       358749.2    95235.7   3.767 0.000166 ***
## bathrooms5.5        809693.9   104724.2   7.732 1.12e-14 ***
## bathrooms5.75       -72486.5   119719.1  -0.605 0.544874    
## bathrooms6          452155.2   112699.2   4.012 6.04e-05 ***
## bathrooms6.25       615522.3   135355.4   4.547 5.47e-06 ***
## bathrooms6.5        134140.4   133142.7   1.007 0.313712    
## bathrooms6.75      1341142.1   169579.1   7.909 2.76e-15 ***
## bathrooms7.75      2737817.5   177664.8  15.410  < 2e-16 ***
## bathrooms8         1696080.5   138313.6  12.263  < 2e-16 ***
## sqft_living        1412948.9    48020.2  29.424  < 2e-16 ***
## floors1.5            -6775.6     4422.9  -1.532 0.125557    
## floors2             -29217.1     3833.9  -7.621 2.65e-14 ***
## floors2.5            82601.3    13867.5   5.956 2.63e-09 ***
## floors3             -89419.3     7922.3 -11.287  < 2e-16 ***
## floors3.5           126311.6    73590.5   1.716 0.086105 .  
## waterfront1         603760.2    16366.7  36.890  < 2e-16 ***
## view1                87146.1     9293.6   9.377  < 2e-16 ***
## view2                68955.2     5658.1  12.187  < 2e-16 ***
## view3               153303.9     7663.6  20.004  < 2e-16 ***
## view4               272209.2    11889.8  22.894  < 2e-16 ***
## grade4             -164197.8   149366.7  -1.099 0.271656    
## grade5             -202224.7   146923.0  -1.376 0.168716    
## grade6             -211707.9   146778.5  -1.442 0.149218    
## grade7             -202115.4   146818.0  -1.377 0.168642    
## grade8             -172827.4   146859.7  -1.177 0.239284    
## grade9              -94414.4   146928.3  -0.643 0.520500    
## grade10              22656.2   147038.1   0.154 0.877545    
## grade11             197241.4   147290.3   1.339 0.180545    
## grade12             467112.1   148327.2   3.149 0.001640 ** 
## grade13            1155883.0   155807.2   7.419 1.24e-13 ***
## sqft_above           40930.0     3181.3  12.866  < 2e-16 ***
## sqft_living15        13026.7     2031.9   6.411 1.48e-10 ***
## `lat47.20 - 47.25`    7622.9    22796.7   0.334 0.738092    
## `lat47.25 - 47.30`   56801.8    36727.6   1.547 0.121986    
## `lat47.30 - 47.35`   47898.8    37606.8   1.274 0.202797    
## `lat47.35 - 47.40`   53471.7    38931.0   1.374 0.169615    
## `lat47.40 - 47.45`   97021.6    40068.0   2.421 0.015470 *  
## `lat47.45 - 47.50`   93070.2    40942.4   2.273 0.023027 *  
## `lat47.50 - 47.55`  120873.1    41577.6   2.907 0.003652 ** 
## `lat47.55 - 47.60`  169214.8    42027.7   4.026 5.69e-05 ***
## `lat47.60 - 47.65`  247363.5    43406.5   5.699 1.23e-08 ***
## `lat47.65 - 47.70`  230068.5    44366.6   5.186 2.18e-07 ***
## `lat47.70 - 47.75`  215981.8    45133.4   4.785 1.72e-06 ***
## `lat47.75 - 47.80`  189948.4    45685.8   4.158 3.23e-05 ***
## condition2           75954.1    33519.9   2.266 0.023468 *  
## condition3           84927.3    31265.9   2.716 0.006608 ** 
## condition4          110606.6    31289.9   3.535 0.000409 ***
## condition5          162552.9    31473.7   5.165 2.44e-07 ***
## yr_renovated1        64266.3     5621.8  11.432  < 2e-16 ***
## zipcode98002           387.4    14042.9   0.028 0.977990    
## zipcode98003         -4009.5    12776.3  -0.314 0.753658    
## zipcode98004        581026.8    23743.7  24.471  < 2e-16 ***
## zipcode98005        153577.5    24632.8   6.235 4.63e-10 ***
## zipcode98006        143975.4    21462.7   6.708 2.03e-11 ***
## zipcode98007         84513.3    25535.3   3.310 0.000936 ***
## zipcode98008         83360.1    24384.1   3.419 0.000631 ***
## zipcode98010         54937.0    18201.7   3.018 0.002546 ** 
## zipcode98011        -13224.5    28974.8  -0.456 0.648100    
## zipcode98014        -75293.9    28097.7  -2.680 0.007376 ** 
## zipcode98019        -61627.4    28894.3  -2.133 0.032951 *  
## zipcode98022         34072.6    33717.7   1.011 0.312258    
## zipcode98023        -33498.2    11182.9  -2.995 0.002744 ** 
## zipcode98024         56962.3    26493.6   2.150 0.031566 *  
## zipcode98027         94804.6    20680.7   4.584 4.59e-06 ***
## zipcode98028        -17205.6    28245.4  -0.609 0.542435    
## zipcode98029        108626.4    22223.2   4.888 1.03e-06 ***
## zipcode98030          5551.5    16215.0   0.342 0.732078    
## zipcode98031        -14411.7    17173.9  -0.839 0.401391    
## zipcode98032         -1255.8    19548.1  -0.064 0.948780    
## zipcode98033        179587.3    25983.4   6.912 4.96e-12 ***
## zipcode98034         36171.6    27121.4   1.334 0.182322    
## zipcode98038         31301.1    14011.4   2.234 0.025498 *  
## zipcode98039        928698.7    34264.2  27.104  < 2e-16 ***
## zipcode98040        381226.3    22303.7  17.093  < 2e-16 ***
## zipcode98042         -2831.4    13661.1  -0.207 0.835810    
## zipcode98045         54456.0    20428.5   2.666 0.007690 ** 
## zipcode98052         61047.3    25377.4   2.406 0.016157 *  
## zipcode98053         26245.3    25754.5   1.019 0.308191    
## zipcode98055         -2750.3    19430.3  -0.142 0.887440    
## zipcode98056         27545.6    20465.8   1.346 0.178342    
## zipcode98058        -11956.6    18124.7  -0.660 0.509465    
## zipcode98059         31291.6    19913.6   1.571 0.116117    
## zipcode98065         22391.8    21723.1   1.031 0.302656    
## zipcode98070         -3765.3    20653.9  -0.182 0.855345    
## zipcode98072          9780.9    28235.2   0.346 0.729041    
## zipcode98074        -19471.8    24381.0  -0.799 0.424507    
## zipcode98075         38717.0    22365.8   1.731 0.083455 .  
## zipcode98077        -26557.8    28837.5  -0.921 0.357091    
## zipcode98092        -21326.8    11968.5  -1.782 0.074781 .  
## zipcode98102        301006.6    28451.8  10.580  < 2e-16 ***
## zipcode98103        169097.9    25816.1   6.550 5.91e-11 ***
## zipcode98105        306119.8    27120.5  11.287  < 2e-16 ***
## zipcode98106         50726.4    21616.5   2.347 0.018954 *  
## zipcode98107        182195.5    26988.5   6.751 1.52e-11 ***
## zipcode98108         27535.9    23297.5   1.182 0.237252    
## zipcode98109        344538.9    28018.8  12.297  < 2e-16 ***
## zipcode98112        425357.6    25216.5  16.868  < 2e-16 ***
## zipcode98115        164599.3    25870.1   6.363 2.03e-10 ***
## zipcode98116        178734.3    22526.9   7.934 2.25e-15 ***
## zipcode98117        152520.8    25918.5   5.885 4.06e-09 ***
## zipcode98118         73299.3    21040.6   3.484 0.000496 ***
## zipcode98119        296744.3    26359.0  11.258  < 2e-16 ***
## zipcode98122        145489.5    25202.7   5.773 7.93e-09 ***
## zipcode98125         38354.3    27440.9   1.398 0.162219    
## zipcode98126        109243.6    21510.9   5.079 3.84e-07 ***
## zipcode98133          6557.5    27385.3   0.239 0.810756    
## zipcode98136        178382.9    22213.9   8.030 1.04e-15 ***
## zipcode98144        162303.9    22428.3   7.237 4.80e-13 ***
## zipcode98146         29240.0    21014.5   1.391 0.164116    
## zipcode98148         17989.5    25969.6   0.693 0.488499    
## zipcode98155         -9493.0    27686.1  -0.343 0.731693    
## zipcode98166         27681.2    19918.9   1.390 0.164640    
## zipcode98168          2693.5    20735.3   0.130 0.896649    
## zipcode98177         78350.1    28406.9   2.758 0.005819 ** 
## zipcode98178        -23646.2    20999.4  -1.126 0.260163    
## zipcode98188        -14923.4    21528.2  -0.693 0.488192    
## zipcode98198        -20883.2    16449.2  -1.270 0.204261    
## zipcode98199        209708.8    25175.0   8.330  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144300 on 17128 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.845 
## F-statistic: 650.6 on 145 and 17128 DF,  p-value: < 2.2e-16

Then, we fit the model to our dataset to predict the price of every house in the train dataset, to see if they reflect the actual value.

train4_fit <- fitted(model1)
train4$pred <- train4_fit

We have to score our model on the MAPE, which is the mean absolute percentage error. The lowest this number, the better the model.

mape <- function(real, predicted) {
  return(mean(abs((real - predicted) / real)))
}
mape(train4$price, train4$pred)
## [1] 0.1668104

We get a MAPE score of 0.167, which is very good for our model.

(2) Random Forest

Now I try a random forest to see if my score increases.

I set model with 500 trees.

train5 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
                     "grade", "sqft_above", "sqft_living15", "lat", "condition",
                     "yr_renovated", "price")]
test5 <- test3
model2 <- randomForest(formula,
                       data = train5)
model2
## 
## Call:
##  randomForest(formula = formula, data = train5) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 26404588065
##                     % Var explained: 80.35

The model scores an R2 of 0.8044, not better than the linear model I run before (1).

(3) Linear regression with LOG

Now I try a linear regression with the log of prices.

First, I create a dataset with all the variables I am going to need in my model.

train6 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
                     "grade", "sqft_above", "sqft_living15", "lat", "condition",
                     "yr_renovated", "zipcode", "price")]
test6 <- test3

Then, I transform the variables prices with the log

train6$price <- log(train6$price)
test6$price <- log(test6$price)

And last, I run my model.

model3 <- train(formula,
                data = train6,
                method = "lm",
                trControl = fitControl,
                metric = "MAE")

This model turns out to be the best, with an R2 of 0.88.

summary(model3)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26579 -0.09984  0.00356  0.10231  1.12159 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        11.808937   0.220018  53.673  < 2e-16 ***
## bedrooms2           0.041196   0.015465   2.664 0.007733 ** 
## bedrooms3           0.052312   0.015509   3.373 0.000745 ***
## bedrooms4           0.049575   0.015870   3.124 0.001788 ** 
## bedrooms5           0.028251   0.016750   1.687 0.091695 .  
## bedrooms6          -0.003276   0.020407  -0.161 0.872457    
## bedrooms7          -0.128867   0.038347  -3.361 0.000780 ***
## bedrooms8          -0.081525   0.060744  -1.342 0.179575    
## bedrooms9          -0.051054   0.107654  -0.474 0.635334    
## bedrooms10         -0.168775   0.107608  -1.568 0.116800    
## bathrooms0.75       0.252896   0.107572   2.351 0.018737 *  
## bathrooms1          0.285058   0.104693   2.723 0.006480 ** 
## bathrooms1.25       0.307928   0.122774   2.508 0.012148 *  
## bathrooms1.5        0.300974   0.104876   2.870 0.004112 ** 
## bathrooms1.75       0.335163   0.104828   3.197 0.001390 ** 
## bathrooms2          0.330657   0.104862   3.153 0.001617 ** 
## bathrooms2.25       0.358195   0.104911   3.414 0.000641 ***
## bathrooms2.5        0.361940   0.104889   3.451 0.000561 ***
## bathrooms2.75       0.366007   0.105020   3.485 0.000493 ***
## bathrooms3          0.360288   0.105117   3.428 0.000611 ***
## bathrooms3.25       0.380001   0.105264   3.610 0.000307 ***
## bathrooms3.5        0.379947   0.105219   3.611 0.000306 ***
## bathrooms3.75       0.388190   0.106352   3.650 0.000263 ***
## bathrooms4          0.358334   0.106498   3.365 0.000768 ***
## bathrooms4.25       0.384542   0.107705   3.570 0.000357 ***
## bathrooms4.5        0.350517   0.107153   3.271 0.001073 ** 
## bathrooms4.75       0.357969   0.113397   3.157 0.001598 ** 
## bathrooms5          0.340433   0.116035   2.934 0.003352 ** 
## bathrooms5.25       0.334061   0.119252   2.801 0.005095 ** 
## bathrooms5.5        0.326108   0.131133   2.487 0.012897 *  
## bathrooms5.75       0.147547   0.149909   0.984 0.325009    
## bathrooms6          0.236772   0.141119   1.678 0.093400 .  
## bathrooms6.25       0.133357   0.169489   0.787 0.431397    
## bathrooms6.5        0.372262   0.166718   2.233 0.025569 *  
## bathrooms6.75      -0.266693   0.212343  -1.256 0.209149    
## bathrooms7.75      -0.225408   0.222467  -1.013 0.310970    
## bathrooms8         -0.524727   0.173193  -3.030 0.002451 ** 
## sqft_living         1.875299   0.060130  31.188  < 2e-16 ***
## floors1.5           0.017854   0.005538   3.224 0.001267 ** 
## floors2            -0.049856   0.004801 -10.385  < 2e-16 ***
## floors2.5          -0.018280   0.017365  -1.053 0.292495    
## floors3            -0.156986   0.009920 -15.825  < 2e-16 ***
## floors3.5          -0.175153   0.092148  -1.901 0.057348 .  
## waterfront1         0.456475   0.020494  22.274  < 2e-16 ***
## view1               0.114449   0.011637   9.835  < 2e-16 ***
## view2               0.104969   0.007085  14.816  < 2e-16 ***
## view3               0.180525   0.009596  18.812  < 2e-16 ***
## view4               0.283162   0.014888  19.019  < 2e-16 ***
## grade4             -0.642328   0.187033  -3.434 0.000595 ***
## grade5             -0.504105   0.183973  -2.740 0.006148 ** 
## grade6             -0.406596   0.183792  -2.212 0.026962 *  
## grade7             -0.307781   0.183842  -1.674 0.094117 .  
## grade8             -0.218095   0.183894  -1.186 0.235645    
## grade9             -0.117054   0.183980  -0.636 0.524633    
## grade10            -0.064642   0.184117  -0.351 0.725523    
## grade11            -0.030581   0.184433  -0.166 0.868309    
## grade12             0.023109   0.185732   0.124 0.900983    
## grade13             0.176309   0.195098   0.904 0.366170    
## sqft_above          0.071466   0.003984  17.940  < 2e-16 ***
## sqft_living15       0.052324   0.002544  20.565  < 2e-16 ***
## `lat47.20 - 47.25`  0.039729   0.028545   1.392 0.164010    
## `lat47.25 - 47.30`  0.165263   0.045989   3.594 0.000327 ***
## `lat47.30 - 47.35`  0.164443   0.047090   3.492 0.000480 ***
## `lat47.35 - 47.40`  0.180708   0.048748   3.707 0.000210 ***
## `lat47.40 - 47.45`  0.267589   0.050172   5.333 9.76e-08 ***
## `lat47.45 - 47.50`  0.245121   0.051267   4.781 1.76e-06 ***
## `lat47.50 - 47.55`  0.350469   0.052062   6.732 1.73e-11 ***
## `lat47.55 - 47.60`  0.458231   0.052626   8.707  < 2e-16 ***
## `lat47.60 - 47.65`  0.509781   0.054353   9.379  < 2e-16 ***
## `lat47.65 - 47.70`  0.505051   0.055555   9.091  < 2e-16 ***
## `lat47.70 - 47.75`  0.474429   0.056515   8.395  < 2e-16 ***
## `lat47.75 - 47.80`  0.420774   0.057207   7.355 1.99e-13 ***
## condition2          0.085037   0.041973   2.026 0.042780 *  
## condition3          0.218205   0.039150   5.573 2.53e-08 ***
## condition4          0.268080   0.039180   6.842 8.06e-12 ***
## condition5          0.329011   0.039411   8.348  < 2e-16 ***
## yr_renovated1       0.082993   0.007039  11.790  < 2e-16 ***
## zipcode98002       -0.039612   0.017584  -2.253 0.024291 *  
## zipcode98003        0.002160   0.015998   0.135 0.892619    
## zipcode98004        0.761151   0.029731  25.601  < 2e-16 ***
## zipcode98005        0.394195   0.030845  12.780  < 2e-16 ***
## zipcode98006        0.348343   0.026875  12.962  < 2e-16 ***
## zipcode98007        0.307884   0.031975   9.629  < 2e-16 ***
## zipcode98008        0.300624   0.030533   9.846  < 2e-16 ***
## zipcode98010        0.240369   0.022792  10.546  < 2e-16 ***
## zipcode98011        0.154543   0.036282   4.260 2.06e-05 ***
## zipcode98014        0.005783   0.035183   0.164 0.869450    
## zipcode98019        0.026112   0.036181   0.722 0.470489    
## zipcode98022        0.183082   0.042220   4.336 1.46e-05 ***
## zipcode98023       -0.048464   0.014003  -3.461 0.000539 ***
## zipcode98024        0.205739   0.033175   6.202 5.71e-10 ***
## zipcode98027        0.324987   0.025896  12.550  < 2e-16 ***
## zipcode98028        0.125683   0.035368   3.554 0.000381 ***
## zipcode98029        0.305300   0.027827  10.971  < 2e-16 ***
## zipcode98030        0.024725   0.020304   1.218 0.223342    
## zipcode98031       -0.015075   0.021505  -0.701 0.483313    
## zipcode98032       -0.053609   0.024478  -2.190 0.028530 *  
## zipcode98033        0.424382   0.032536  13.044  < 2e-16 ***
## zipcode98034        0.213127   0.033961   6.276 3.56e-10 ***
## zipcode98038        0.130213   0.017545   7.422 1.21e-13 ***
## zipcode98039        0.866585   0.042905  20.198  < 2e-16 ***
## zipcode98040        0.584275   0.027928  20.921  < 2e-16 ***
## zipcode98042        0.028910   0.017106   1.690 0.091036 .  
## zipcode98045        0.250503   0.025580   9.793  < 2e-16 ***
## zipcode98052        0.281106   0.031777   8.846  < 2e-16 ***
## zipcode98053        0.232441   0.032249   7.208 5.93e-13 ***
## zipcode98055        0.031107   0.024330   1.279 0.201071    
## zipcode98056        0.151339   0.025627   5.906 3.58e-09 ***
## zipcode98058        0.055738   0.022695   2.456 0.014061 *  
## zipcode98059        0.207716   0.024935   8.330  < 2e-16 ***
## zipcode98065        0.181036   0.027201   6.655 2.91e-11 ***
## zipcode98070        0.314529   0.025862  12.162  < 2e-16 ***
## zipcode98072        0.197403   0.035355   5.583 2.39e-08 ***
## zipcode98074        0.184361   0.030529   6.039 1.58e-09 ***
## zipcode98075        0.227151   0.028006   8.111 5.36e-16 ***
## zipcode98077        0.170110   0.036110   4.711 2.49e-06 ***
## zipcode98092        0.018911   0.014987   1.262 0.207011    
## zipcode98102        0.605860   0.035627  17.006  < 2e-16 ***
## zipcode98103        0.473939   0.032326  14.661  < 2e-16 ***
## zipcode98105        0.590728   0.033960  17.395  < 2e-16 ***
## zipcode98106        0.135825   0.027068   5.018 5.27e-07 ***
## zipcode98107        0.500034   0.033794  14.796  < 2e-16 ***
## zipcode98108        0.107507   0.029173   3.685 0.000229 ***
## zipcode98109        0.652427   0.035084  18.596  < 2e-16 ***
## zipcode98112        0.680156   0.031575  21.541  < 2e-16 ***
## zipcode98115        0.463131   0.032394  14.297  < 2e-16 ***
## zipcode98116        0.447828   0.028208  15.876  < 2e-16 ***
## zipcode98117        0.462297   0.032454  14.244  < 2e-16 ***
## zipcode98118        0.220326   0.026346   8.363  < 2e-16 ***
## zipcode98119        0.624008   0.033006  18.906  < 2e-16 ***
## zipcode98122        0.440330   0.031558  13.953  < 2e-16 ***
## zipcode98125        0.248613   0.034361   7.235 4.84e-13 ***
## zipcode98126        0.311791   0.026935  11.575  < 2e-16 ***
## zipcode98133        0.154186   0.034291   4.496 6.96e-06 ***
## zipcode98136        0.470363   0.027816  16.910  < 2e-16 ***
## zipcode98144        0.343314   0.028084  12.224  < 2e-16 ***
## zipcode98146        0.117542   0.026314   4.467 7.99e-06 ***
## zipcode98148        0.052616   0.032519   1.618 0.105672    
## zipcode98155        0.130782   0.034668   3.772 0.000162 ***
## zipcode98166        0.219213   0.024942   8.789  < 2e-16 ***
## zipcode98168       -0.030499   0.025964  -1.175 0.240150    
## zipcode98177        0.301914   0.035570   8.488  < 2e-16 ***
## zipcode98178       -0.006055   0.026295  -0.230 0.817897    
## zipcode98188        0.006968   0.026957   0.259 0.796023    
## zipcode98198        0.014152   0.020597   0.687 0.492052    
## zipcode98199        0.502300   0.031523  15.934  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1807 on 17128 degrees of freedom
## Multiple R-squared:  0.8839, Adjusted R-squared:  0.8829 
## F-statistic: 899.4 on 145 and 17128 DF,  p-value: < 2.2e-16

Then, we fit the model to our dataset to predict the price of every house in the train dataset, to see if they reflect the actual value.

train6_fit <- fitted(model3)
train6$pred <- train6_fit

Now we turn the prices back to normal, using the exponential function.

train6$price <- exp(train6$price)
train6$pred <- exp(train6$pred)

We have to score our model on the MAPE, which is the mean absolute percentage error. The lowest this number, the better the model.

mape <- function(real, predicted) {
  return(mean(abs((real - predicted) / real)))
}
mape(train6$price, train6$pred)
## [1] 0.1348133

We get a MAPE score of 0.135, which is very good for our model.

Predict on the test set and export it

The third model was the best we got so far, with an R2 of 0.88 and a MAPE of 0.13, so we decide to deploy this one.

Now we predict the prices on the test dataset.

test6$price <- predict(model3, test6)
test6$price <- exp(test6$price)

And I transform my test dataframe with just the ID and the price predicted.

submission <- test6[, c("id", "price")]

Last, I export my dataframe as a csv file.

write.csv(submission, "Alberto_Lombatti_submission.csv", row.names = FALSE)